Compound terms and their constituent elements in information retrieval

نویسنده

  • Jussi Karlgren
چکیده

Compounds, especially in languages where compounds are formed by concatenation without intervening whitespace between elements, pose challenges to simple text retrieval algorithms. Search queries that include compounds may not retrieve texts where elements of those compounds occur in uncompounded form; search queries that lack compounds will not retrieve texts where the salient elements are buried inside compounds. This study explores the distributional characteristics of compounds and their constituent elements using Swedish, a compounding language, as a test case. The compounds studied are taken from experimental search topics given for CLEF, the Cross-Language Evaluation Forum and their distributions are related to relevance assessments made on the collection under study and evaluated in terms of divergence from expected random distribution over documents. The observations made have direct ramifications on e.g. query analysis and term weighting approaches in information retrieval system design. 1 What is a compound? Compounding is one of the basic methods of word-formation in human language. Two or more base word elements which typically occur independently as single words are juxtaposed to form a more complex entity. The compound elements can be concatenated without space, joined with a hyphen, or form an open compound with white space in between: “classroom”, “cross-lingual”, “high school”. Compounding is a productive process: new compounds can be formed on the fly for ad-hoc purposes to treat topical elements in the discourse at hand. The semantics of a compound is typically related to the constituent elements, and most often the former constituent modifies the latter. Compounding has been studied in detail although not always in terms of function by linguists, terminologists, grammarians, and lexicologists over the past years; there are excellent overviews available for most any language one might be interested in. Compounding processes may show great surface differences between languages. Some languages use script systems that make no discernible difference between compounds and happenstance or syntactically motivated juxtaposition – ideogrambased Asian scripts, such as Japanese or Chinese, e.g. Some languages show a preponderance of open compounds and are restrictive in forming new closed compounds, such as the English language (see Quirk et al. (1985) for a comprehensive treatment of English compounding). Other languages again, such as Swedish, a near relation of English both in terms of cultural and linguistic history, tend towards closed compounds – with no white space between elements (see Noréen (1906) for a comprehensive treatment of Swedish compounding). Compounds that originally are formed on the fly are eventually lexicalized and gain status as terms in their own right in a language. Terms such as “staircase”, “blackbird” or “doorjamb” are not dynamically constructed for the purpose of a single discourse session or a single text – they are single simple words from the perspective of the language user. Compounds can also be borrowed from and loaned to other Proceedings of the 15th NODALIDA conference, Joensuu 2005 Ling@JoY 1, 2006 languages and then often lose their character as a composite term: “homicide” is only in some sense an English compound. Other types of derivation such as affixation also resembles compounding to the extent that it may be difficult to draw a line. Is “eatable” a compound of “eat” and “able”? These types of process make compound analysis a demanding task for language engineering applications. When is it motivated to segment a compound to make it understandable and when should it be understood to be a lexical item in its own right? 2 Compounds in information

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Interactive Search Elements in Digital Libraries

Background and Aim: Interaction in a digital library help users locating and accessing information and also assist them in creating knowledge, better perception, problem solving and recognition of dimension of resources. This paper tries to identify and introduce the components and elements that are used in interaction between user and system in search and retrieval of information in digital li...

متن کامل

بررسی نقش انواع بافتار هم‌نویسه‌ها در تعیین شباهت بین مدارک

Aim: Automatic information retrieval is based on the assumption that texts contain content or structural elements that can be used in word sense disambiguation and thereby improving the effectiveness of the results retrieved. Homographs are among the words requiring sense disambiguation. Depending on their roles and positions in texts, homograph contexts could be divided to different types, wit...

متن کامل

The Role of the FUM Students' Demographic Features in the Relevance Judgment Scores of Their Information Retrieval Results in Search Engines

In order to design user-friendly information retrieval systems, it is important to pay attention to characteristics of users. Therefore, the aim of the present study is to investigate the role of demographic variables of users during their search in search engines. Method: This is an applied study in terms of purpose, which was done by the evaluation method. To conduct the research, firstly,...

متن کامل

The automatic building and expression of complex concepts: the generation of novel compound nominals to express the 'aboutness' concepts of a text

The work in progress described here concerns the problem of generating compound nominals (such as 'electronic games industry growth', 'electronic games company advertising budgets') in an appropriate context. Past approaches to the problems presented to linguists by compound nominals (CNs) have had limited success. This report presents a new way of looking at CNs, with the emphasis on their con...

متن کامل

طراحی الگوی مدیریَت اطلاعات بهداشتی در مراکز سالمندان ایران،1385

Introduction: Nursing care facilities are among a variety of health care services. Nursing care facilities refers to a broad spectrum of health, social, supportive, medical and rehabilitation cares .People that lives in these facilities can choose their services .Then, nursing care facilities need some professional organizing and standards about health information management. Methods: This is a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005